WEBVTT

00:00.000 --> 00:04.520
speaking on the question that I think is actually the best-named title of ShmooCon.

00:04.520 --> 00:11.360
Who's got time for that? Looking for advisories and blogs for threat data.

00:12.160 --> 00:16.040
Thanks for that introduction and for the shout out to the title.

00:16.040 --> 00:22.480
Hi everyone, I'm Shimon. I'm Elvis Hovar. And we are both security researchers

00:22.480 --> 00:28.280
working for Accenture Technology Labs. It's the R&D arm of Accenture, if

00:28.280 --> 00:35.200
you've heard of the large consulting firm. And who are we? So both of us are

00:35.200 --> 00:39.440
security researchers. We also have a research team that's focused on natural

00:39.440 --> 00:43.880
language processing. The core mission of Accenture Technology Labs as a whole is

00:43.880 --> 00:50.360
to address essentially challenging problems within certain domains. Elvis

00:50.360 --> 00:54.440
and I focus on security and because of the wealth of expertise there is, we also

00:54.440 --> 00:59.200
do a lot of cross-domain research. So we're essentially working on different

00:59.200 --> 01:03.760
research projects and seeing if we can address any real-world problems. So one

01:03.760 --> 01:07.840
of the things that we have been focused on over the last few months is the

01:07.840 --> 01:12.880
notion of how do you operationalize on threat data. And I will not use the word

01:12.880 --> 01:17.040
threat intelligence because one thing I want to stay away from is at least

01:17.040 --> 01:21.120
giving any false notion that we are trying to tackle the large problem of

01:21.120 --> 01:25.080
threat intelligence. You're still focused on threat data. And what we

01:25.080 --> 01:29.880
want to talk about is the techniques that we are researching to bring to

01:29.880 --> 01:35.600
bear up on from the NLP side, the natural language processing side,

01:35.600 --> 01:45.320
to the threat data extraction within the security domain. Right, so I just want to

01:45.320 --> 01:49.280
create a frame of reference around the problems that we want to talk about. And

01:49.280 --> 01:54.720
when we talk about the challenge space, again, again focus on threat data itself.

01:54.720 --> 01:57.640
Right, when we start talking about threat intelligence, there are a lot of connotations.

01:57.640 --> 02:02.120
Not talking about advisories, we're not talking about indicators, but the actual

02:02.120 --> 02:06.520
data that... How many of you are threat analysts in the room here or spend a lot

02:06.520 --> 02:11.400
of time looking at threat data? Right, it's you folks who are turning the data

02:11.400 --> 02:15.920
into intelligence. What we really want to focus on here is how do we

02:15.920 --> 02:19.840
put the right kind of tools that can make your life easier, make you more

02:19.840 --> 02:24.120
efficient. The human is not going out of the loop. The human will stay in the loop

02:24.120 --> 02:28.960
no matter what we say. So let's try and address challenges that humans are

02:28.960 --> 02:33.360
facing today. What are the kind of desired functionalities, right, as we

02:33.360 --> 02:37.520
started looking at the challenge space? There are some low hanging fruits and

02:37.520 --> 02:43.520
there are some pie in the sky, blue sky ideas. What is the entire spectrum of

02:43.520 --> 02:47.000
those ideas and which are the ones that we picked that we felt were at least

02:47.000 --> 02:52.720
addressable in the near term range? The framework behind how we are

02:52.720 --> 02:57.360
cross-pollinating ideas from natural language processing and information

02:57.360 --> 03:02.920
extraction into the security domain itself. I particularly haven't come

03:02.920 --> 03:05.960
across too much research within the space and I'll be interested in

03:05.960 --> 03:10.880
hearing more about from at the end. If there are any ideas and if you've seen

03:10.880 --> 03:18.040
other research work that's similar. The methodology and results. So this is an

03:18.040 --> 03:22.480
experimental framework. Natural language processing underlying that is machine

03:22.480 --> 03:27.240
learning. It is a continuous iterative process in terms of improving,

03:27.240 --> 03:31.960
leveraging the human intelligence and human expertise to make the machine part

03:31.960 --> 03:36.120
more intelligent. So it is an iterative process. We are going through

03:36.120 --> 03:42.800
this in terms of it is a work in progress, right, so the results that you see are

03:42.800 --> 03:48.800
real results and some of them good, some of them not so hot. So we'll be

03:48.800 --> 03:53.640
pointing out the kind of challenges that we are seeing. Don't want to just talk

03:53.640 --> 03:58.120
about this. This is not supposed to be a pedantic presentation. So what is it

03:58.120 --> 04:03.320
that we have put together that really brings this idea to life and what do we

04:03.320 --> 04:09.200
see going forward in terms of how do you make these improvements and again would

04:09.200 --> 04:14.900
love to hear feedback and opinions around what can be done based on the

04:14.900 --> 04:21.720
kind of things that you're seeing in your day-to-day lives. All right, so as

04:21.720 --> 04:25.920
you as you read through this, I'm not going to drain the slide, this should

04:25.920 --> 04:31.440
resonate with those of you who do spend even a little bit of time trying to

04:31.440 --> 04:36.440
study or trying to look through threat data on a day-to-day basis. To me

04:36.440 --> 04:43.360
it's the 80-20 problem of human analysis. Instead of spending 20% of your

04:43.360 --> 04:49.120
effort to get 80% of the results, it's kind of the other way around and correct

04:49.120 --> 04:54.520
me if I'm wrong, right, how many of you spend a lot of time pre-processing the

04:54.520 --> 04:58.560
advisories, the blogs, the data to actually put it into some kind of a

04:58.560 --> 05:03.800
usable format. Those are essentially pre-processing techniques that can be

05:03.800 --> 05:10.520
easily replicated using computing technology, right. So what we what you

05:10.520 --> 05:16.480
want to try and do in terms of addressing the core issue of trying to

05:16.480 --> 05:21.000
leverage humans to do things that they are actually good at and not wasting

05:21.000 --> 05:27.200
their time on techniques or processes that can be actually automated. Humans

05:27.200 --> 05:32.440
don't scale, right. There are only so many experts and analysts that

05:32.440 --> 05:36.480
we have. Some of the largest threat teams that I've seen are three or four

05:36.480 --> 05:41.920
people. If they are spending most of the time trying to read through documents

05:41.920 --> 05:47.200
then things are bound to fall to the crack. Again, relying on human memory to

05:47.200 --> 05:54.200
be able to detect or remember and co-correlate insights across several

05:54.200 --> 06:01.080
documents is another massive waste of time and energy and effort on

06:01.080 --> 06:06.600
behalf of these analysts. How many of you use spreadsheets to convert data from

06:06.600 --> 06:13.160
PDFs, text documents, blogs and use that as a continual source of repository? No

06:13.160 --> 06:16.480
one? All right, well that's good. It seems like a state of the art has

06:16.480 --> 06:22.640
moved on. So but it is a status quo right now. We are still focused on using

06:22.640 --> 06:27.040
humans where they are not efficient or rather they shouldn't be doing some

06:27.040 --> 06:33.760
of those things. So how do we use computing technology to move the dial

06:33.760 --> 06:41.440
towards that 80-20 rule, right. And as you as you look through this and these are

06:41.440 --> 06:46.920
kind of insights that we came across as we spoke to teams that we support and as

06:46.920 --> 06:54.280
security researchers ourselves is what can we do with thread data that or at

06:54.280 --> 06:57.640
least the problems related to thread data that have been solved in other

06:57.640 --> 07:02.480
domains. National language processing was one of the first things that that came

07:02.480 --> 07:07.960
to mind. Being able to automate the parsing or even semi automate the

07:07.960 --> 07:13.840
parsing of data into some kind of output that is semi-structured. That

07:13.840 --> 07:20.600
itself gets you much closer to being able to use at least some semi

07:20.600 --> 07:24.840
automated techniques for analysis of the data. Being able to share that data, right.

07:24.840 --> 07:29.400
Don't want to still go to the process of picking up the phone or sending an email

07:29.400 --> 07:34.040
which is again in some in some kind of an unstructured format expecting the

07:34.040 --> 07:38.320
person on the receiving end to parse through it and understand what that

07:38.320 --> 07:43.600
format means. So getting to a common understanding is something that we can

07:43.600 --> 07:50.560
get to using national language processing techniques. So as we as we

07:50.560 --> 07:56.240
started talking to the national language processing experts that we have within

07:56.240 --> 08:01.920
with an Accenture Technology Labs it became clear that they're that they have

08:01.920 --> 08:07.600
been working on addressing this problem in other domains. So we figured how do we

08:07.600 --> 08:13.480
bring security expertise, layer that with national language processing and try and

08:13.480 --> 08:17.680
at least address this problem to the point where we're not going to find the

08:17.680 --> 08:21.680
silver bullet here. That was that was never the intent but even if we can

08:21.680 --> 08:25.800
increase the efficiency of human analysts by even 10% I think we are

08:25.800 --> 08:29.240
taking a step in the right direction.

08:31.480 --> 08:37.560
So again I'm not going to drain the slide. It's pretty pretty simple overview of

08:37.560 --> 08:44.560
the framework that we are using to to create this solution for a research

08:44.560 --> 08:48.960
project but there are a couple of things that I do want to focus on. First one is

08:48.960 --> 08:52.840
the information extraction module. I'll be going through exactly why we picked a

08:52.840 --> 08:56.440
specific application within natural language processing which is information

08:56.440 --> 09:02.800
extraction. The other is standardization. The whole one of the objectives that we

09:02.800 --> 09:08.320
set out to achieve was converting this unstructured or semi-structured data into

09:08.320 --> 09:14.160
a common understanding or commonly understood format. Why STICS? Well I'm not

09:14.160 --> 09:18.560
going to go into the justification and rationale behind STICS itself.

09:18.560 --> 09:24.760
It's a standardization effort. Is Sean Barnum here? No. He had a

09:24.760 --> 09:29.800
great talk last year about how standardization efforts help with

09:29.800 --> 09:35.600
sharing insights. The guys at MITRE are doing some really great work. So again

09:35.600 --> 09:39.720
the notion was need to get to a standardized format and we went with

09:39.720 --> 09:47.160
STICS. We also instead of storing the data in just a traditional

09:47.160 --> 09:53.200
relational database we decided to go with a graph database and now as you as

09:53.200 --> 09:58.960
you look at how threat data itself is represented it is multi-dimensional.

09:58.960 --> 10:04.520
There are lots of relationships and being able to maintain these relationships

10:04.520 --> 10:10.360
in a relational database we found out very quickly was was going to be a task

10:10.360 --> 10:16.240
not not worth pursuing. Graph database themselves lend themselves really well

10:16.240 --> 10:22.680
to representing this hyperdimensional relations between between the data

10:22.680 --> 10:29.720
itself. So being able to quickly recall the degrees of separation between

10:29.720 --> 10:33.960
different data between different threat data and the documents that they're

10:33.960 --> 10:38.320
assigned in those were the kind of things that we wanted to represent and

10:38.320 --> 10:42.600
be able to do it quickly. And the third module that I really want to focus on is

10:42.600 --> 10:47.080
the learning part. As I mentioned at start this is a machine learning

10:47.080 --> 10:52.040
technique at the end of the day. There needs to be a way of continually

10:52.040 --> 10:59.160
sending feedback into into the system. So we wanted to give the analysts the

10:59.160 --> 11:04.120
experts the ability to either say yes a certain pattern is good or a certain

11:04.120 --> 11:10.080
pattern is not good. This is this is not a perfect system so we want to use the

11:10.080 --> 11:15.600
effort that analysts are already putting into analyzing this data and help the

11:15.600 --> 11:23.200
system improve over time. NLP 101 just a quick primer. How many of you are

11:23.200 --> 11:26.120
familiar with natural language processing techniques or have done the

11:26.120 --> 11:30.360
same kind of research on it? So there's a few of you here. The primary

11:30.360 --> 11:37.440
purpose is to replicate how humans process language and be able to replicate

11:37.440 --> 11:41.760
the linguistic analysis techniques. Several different applications of

11:41.760 --> 11:44.520
natural language processing. Natural language processing itself is kind of a

11:44.520 --> 11:49.720
catch-all term. Within the field itself there are several very

11:49.720 --> 11:53.680
specific applications and outputs that you get from these different

11:53.680 --> 11:58.640
applications. We focused on information extraction and the idea behind

11:58.640 --> 12:03.160
information extraction is to be able to automate the task of identifying,

12:03.160 --> 12:08.280
collecting, and normalizing the relevant information into a structured output.

12:08.280 --> 12:13.840
So information extraction is more than just keywords. It's more than just

12:13.840 --> 12:17.680
string processing although there are some of those elements involved in

12:17.680 --> 12:22.800
there. But we are not going to the point or to the extent of fully understanding

12:22.800 --> 12:29.000
the text the way humans have the capability of doing. So these

12:29.000 --> 12:33.840
applications are very domain specific. Information extraction works well if you

12:33.840 --> 12:38.920
can represent the human expertise within a specific domain the right way. So that

12:38.920 --> 12:47.000
was where a lot of our time and effort was spent. Great school anyone?

12:47.000 --> 12:52.560
This is just a quick review of the linguistic analysis that we go through.

12:52.560 --> 12:57.680
Humans do this really easily, very intuitively. Trying to have a

12:57.680 --> 13:01.680
machine do this is an extremely difficult challenge. It is not a

13:01.680 --> 13:06.200
trivial challenge. So as we go through the increasing levels of

13:06.200 --> 13:11.600
analysis, we looked at doing morphological analysis which is breaking

13:11.600 --> 13:17.360
the words or a complex word into its basic form. So if you take for example

13:17.360 --> 13:21.400
the sentence, the three buffer overflow vulnerabilities could allow remote

13:21.400 --> 13:25.560
code execution. So when you look at the word overflow,

13:25.560 --> 13:30.320
break it down into over and flow. Try and get to the basic units. Understanding the

13:30.320 --> 13:34.080
syntax, so are the words correctly placed and what kind of meaning are you getting

13:34.080 --> 13:39.120
out of it? And the third level of analysis, the semantic analysis itself,

13:39.120 --> 13:43.960
which is really understanding what does a word mean within the context of the

13:43.960 --> 13:47.280
sentence. So if you look at the word buffer, there are several meanings. There's

13:47.280 --> 13:52.880
noun form, the verb form. Within the context of computation, it means a

13:52.880 --> 13:58.240
storage, right? It's storage. So being able to get to the right meaning of the word

13:58.240 --> 14:05.120
is extremely important. So as we started looking at and working with our

14:05.120 --> 14:08.720
natural language processing experts, these were the kind of things that we

14:08.720 --> 14:15.080
really focused on. So just a high level summary of approach. Now here's where we

14:15.080 --> 14:21.080
really dive into the kind of framework that we set up for extracting the data

14:21.080 --> 14:29.200
and also we'll talk about the different algorithms that we used for

14:29.200 --> 14:36.040
analyzing the data. So it was a six-step process. The first step, just break the

14:36.040 --> 14:42.040
entire text into its sentences, right? So sentence segmentation. The second step,

14:42.040 --> 14:47.080
which was a core part of our work, was to generate patterns related to each of the

14:47.080 --> 14:53.600
STICs data constructs. So when you look at IOCs, TTPs, course of action, we

14:53.600 --> 15:00.760
generated base patterns that would be used for the analysis in later stages.

15:00.760 --> 15:05.480
Step three was stemming. This is again an NLP technique to reduce the

15:05.480 --> 15:12.440
complexity in the actual language itself. So stemming essentially boils or

15:12.440 --> 15:17.440
distills different forms of the word to its basic form. Think hacker, hackers,

15:17.440 --> 15:22.040
hacking, they all boil down to hack. So that's the idea behind stemming,

15:22.040 --> 15:28.200
reducing the complexity. Now step four is where you take each sentence and

15:28.200 --> 15:33.920
calculate the similarity score by comparing the sentence against all the

15:33.920 --> 15:40.360
patterns that we had already generated in step two, right? So if you

15:40.360 --> 15:44.000
take a sentence, compare it against all the patterns in course of action. Take a

15:44.000 --> 15:49.720
sentence, compare it against all the patterns within TTPs, IOCs, and so on.

15:49.720 --> 15:54.240
Once you went through that entire process, we'd find the best match in

15:54.240 --> 16:01.000
terms of the similarity score, the pattern set that was or the patterns

16:01.000 --> 16:07.440
that were actually matched against that sentence with the highest score. And then

16:07.440 --> 16:14.160
we repeated steps four and five for each STIX data construct. So that was

16:14.160 --> 16:18.960
the basic idea behind this entire process.

16:20.400 --> 16:26.160
So for the pattern generation itself, this is a very crucial

16:26.160 --> 16:30.760
component of information extraction. Garbage in, garbage out. Very true

16:30.760 --> 16:34.920
for information extraction. So it was important that we started out with

16:34.920 --> 16:39.080
at least the right base patterns. We use a couple of different approaches

16:39.080 --> 16:46.360
for this. The first one was a supervised approach. We actually, Elvis,

16:46.360 --> 16:51.320
manually annotated the pattern list that needed to be learned. This is essentially

16:51.320 --> 16:58.440
a classification challenge. So you have a corpus of documents, manually annotate

16:58.440 --> 17:04.360
them, feed them through the classifier, and let the classifier learn, right? So

17:04.360 --> 17:08.280
we call it a supervised approach. And then we also use a semi-supervised

17:08.280 --> 17:14.280
approach. This was used to boost the number of patterns, the base patterns

17:14.280 --> 17:20.200
that we could generate. The supervised approach is fairly tedious. It does

17:20.200 --> 17:27.320
require a lot of time and effort. So we wanted to boost the actual pattern list.

17:27.320 --> 17:32.360
And we used the semi-supervised approach. The idea here is you create a set of seed

17:32.360 --> 17:37.080
patterns and you feed it documents that are not annotated. And because they're not

17:37.080 --> 17:41.360
annotated, this goes through a bootstrapping process where it keeps learning

17:41.360 --> 17:47.320
over time. There is human intervention required. If you just allow it to learn

17:47.320 --> 17:51.000
on its own and if it keeps learning garbage, then you'll end up with garbage

17:51.000 --> 17:58.280
at the end. So that was another part of our effort was to find the right kind of

17:58.280 --> 18:03.200
base patterns. And let me quickly go to the next slide so Elvis can then really

18:03.200 --> 18:09.000
show you what this looks like in real life. So we went through several

18:09.000 --> 18:16.120
iterations of training and testing. We used documents and advisories from MSISAC,

18:16.120 --> 18:23.080
FSISAC, and ICSRT, ran it through the information extraction process, matched

18:23.080 --> 18:27.160
it against all the learned patterns that we had created, right? So the patterns

18:27.160 --> 18:31.320
that we created in the previous step, matched it against those, and we

18:31.320 --> 18:36.200
calculated three different similarity scores, lexical, semantic, and contextual

18:36.200 --> 18:41.480
similarity. Now each of these have their own advantages and disadvantages.

18:41.480 --> 18:46.320
So the lexical similarity score works well when you're looking for co-occurring

18:46.320 --> 18:56.600
set of words. So think Siemens and Symantec or any other ICS domain product.

18:56.600 --> 19:01.080
There are a certain set of co-occurring words that you can identify. The

19:01.080 --> 19:05.640
semantic similarity score tried to detect a concept through a word or a

19:05.640 --> 19:12.280
phrase, and then the contextual similarity score analyzed a certain window of words

19:12.280 --> 19:18.120
around a known pattern to try and understand the context. As Elvis goes to

19:18.120 --> 19:22.640
the demo, he'll talk about why we generated the three different scores and

19:22.640 --> 19:29.920
how they were useful in being able to map to the right STIX construct. So

19:29.920 --> 19:35.560
aggregate the score, look at the result, and if the results good, then make sure

19:35.560 --> 19:39.800
that the pattern is in the pattern list. If not, add it to it. If the score is bad,

19:39.800 --> 19:44.400
try and understand what went wrong, finding the parameters. This is your

19:44.400 --> 19:49.800
traditional training and testing process that you use for machine learning. So

19:49.800 --> 19:53.640
enough slides. Just want to make sure that everyone's awake. Threat intelligence.

19:53.640 --> 20:01.880
Any schmooble? No. All right. So Elvis.

20:04.000 --> 20:13.920
Can I drop the mic when I'm done? Hi everyone. My name is Elvis Hovor. I have

20:13.920 --> 20:18.240
been with Accenture Technology Labs about three years. I started out of grad

20:18.240 --> 20:23.440
school. I lead some of the development work in threat intelligence down there.

20:23.440 --> 20:52.640
Okay, so let's start the demo if I can find it.

20:53.440 --> 21:01.720
I don't have anything higher than 1440. That's the best I have.

21:01.720 --> 21:25.120
Anyway, is it showing any better now? What can I do? Help here. Reduce it. Zoom out.

21:25.120 --> 21:32.160
Okay. That didn't help.

21:39.040 --> 21:51.280
That good? Thank you. Okay. So I think like Shimon said, initially we started with

21:51.280 --> 21:55.800
trying to think about, you know, how to help analysts really prioritize it.

21:55.800 --> 22:01.560
Advice we study get on a daily basis. I'm sure that a bunch of analysts here you

22:01.560 --> 22:06.800
get documents on the regular that you need to read through every morning. Try

22:06.800 --> 22:11.440
and figure out which one to read first is a task on its own. You know, which one

22:11.440 --> 22:15.860
has the most important information or relevant information to you is also

22:15.860 --> 22:19.400
another task that you have to go through. Wasting your time trying to figure out

22:19.400 --> 22:23.200
which document to read and all of that. Shimon went through a lot of that. What we

22:23.200 --> 22:27.800
wanted to do with our research project was initially started with just trying

22:27.800 --> 22:33.400
to prioritize these documents. But as time went on we decided to wrap a UI

22:33.400 --> 22:38.760
around that to make it easy and more usable for anyone that wanted to use a

22:38.760 --> 22:42.800
tool. So we moved away from just the research portion of it to try and put a

22:42.800 --> 22:50.200
UI on top of it. What you see here is, you know, on a regular day you come in and

22:50.200 --> 22:54.760
load up bulk documents basically. It is placed on this page and it

22:54.760 --> 23:00.080
shows you scores on which one we think is, well, you have prioritized the most.

23:00.080 --> 23:04.920
Once again this is a research idea we give a scores because we thought that

23:04.920 --> 23:08.400
those were a good way to score the documents that we have. But, you know,

23:08.400 --> 23:13.200
based on your own preferences you can score it differently and it will show

23:13.200 --> 23:17.000
differently. But you look here and then you see a set of documents and these

23:17.000 --> 23:20.820
are the ones that are prioritized for you to read first and second and third. It

23:20.820 --> 23:25.440
came in on different days. We have a set of data that spans maybe three days. We

23:25.440 --> 23:29.680
wish we had more but it spans about three or four days. That's about how much

23:29.680 --> 23:34.680
we have. I would try and go through the process with you loading up how we are

23:34.680 --> 23:38.800
calculating the scores and what kind of information we are getting out of it.

23:38.800 --> 23:44.440
The data elements we are able to extract out of each document. What extraction

23:44.440 --> 23:50.040
method works better for us and what we realized would work to extract certain

23:50.040 --> 23:54.520
six elements or constructs better than other ones. And so as I go through I'll

23:54.520 --> 24:02.920
explain that more. So come here and upload a data. First I would upload a file that

24:02.920 --> 24:09.720
has, I guess, the patterns which trained or according to that file structure. So

24:09.720 --> 24:16.120
extracts a lot better if you trained your your pattern set against a specific

24:16.120 --> 24:21.040
document and have trained your modules over time to be able to extract that

24:21.040 --> 24:27.160
data more. So you would see that one works very well and the other works

24:27.160 --> 24:33.120
somewhat okay. So let me try and upload.

24:40.280 --> 24:44.000
Okay this is a file I can upload.

24:44.000 --> 24:58.040
Oh and we were trying to, we're trying to make it a batch file because it takes

24:58.040 --> 25:01.920
sometimes it takes just a little bit of time to run a single file. So we're

25:01.920 --> 25:05.120
thinking if someone has multiple files that they want to run you want to be

25:05.120 --> 25:08.000
able to make sure that you can make it a bad job for the first one to process

25:08.000 --> 25:12.240
and the second one to come in. But for some reason Java wouldn't allow, well we

25:12.240 --> 25:17.400
haven't figured out how to have the computer run the Java command on its own

25:17.400 --> 25:21.440
because it's kind of like sandbox and would allow us unless we enter the

25:21.440 --> 25:24.920
command. So I have to enter it here but that gives you an opportunity to see

25:24.920 --> 25:29.640
what's going on in the background. How the you know how it's comparing to the

25:29.640 --> 25:33.440
patterns like Shimon showed earlier on.

25:33.440 --> 25:47.120
So that's the extraction that's going on right now. So if I get back to my page

25:47.120 --> 25:53.040
here and I go to my prioritized list at the end of a list table you should be

25:53.040 --> 26:11.600
able to see the file that we just entered. We shorted out a little bit but

26:11.600 --> 26:16.560
what eventually happens is that when you run it through and the natural language

26:16.560 --> 26:19.240
processing module is running in the back and they should tell you that it's being

26:19.240 --> 26:24.880
processed and when it's done when you refresh your page you would see it shows

26:24.880 --> 26:27.640
an output that says that it's been processed and so you can go ahead and

26:27.640 --> 26:30.560
start looking at the file. So if you have a bunch of files this is going to be a

26:30.560 --> 26:39.080
batch process that keeps going through and taking the files through. I have to

26:39.080 --> 26:44.320
shift this a little bit just so I can get to that. So that's it. The file is

26:44.320 --> 26:48.720
uploaded it's running right now. I'm sure it's done if I highlight it it should

26:48.720 --> 26:55.640
move. Please pay attention to the file name I think it's USCIC cert.

27:18.720 --> 27:28.000
Okay yeah thank you.

27:32.120 --> 27:39.640
So now we that means that it found enough information in there to think and

27:39.640 --> 27:44.480
score it to take it to the third in the in your stack of documents that you have.

27:44.480 --> 27:48.600
I'll tell you a little bit about the scoring as we go through and why the

27:48.600 --> 27:53.160
scoring is that way. It's not perfect yet but as time goes on I think we would

27:53.160 --> 27:56.320
tweak the scoring a lot more to get some of the information that we want out of

27:56.320 --> 28:00.560
it. So the document will show on the right hand side and this elements

28:00.560 --> 28:04.480
extracted would be on the right hand side. So you see your COAs all the things

28:04.480 --> 28:13.380
that it was able to extract out of the document you would see from here. So for

28:13.380 --> 28:19.640
COAs we can see that extracted a bunch of things out. The score you see here is

28:19.640 --> 28:25.880
1.0. 1.0 because anything that comes out of contextual similarity extraction if

28:25.880 --> 28:31.120
we use that method gives us a binary number it's either 0 or 1. It was able to

28:31.120 --> 28:34.960
extract something it wasn't. Semantic and lexical are a little more different in

28:34.960 --> 28:39.320
that the scores actually work and you can tweak them very well. So that is one

28:39.320 --> 28:43.240
part that kind of skews our score a little bit but what ends up happening is

28:43.240 --> 28:47.600
that if you still don't find anything for or if you find contextual which is

28:47.600 --> 28:53.200
one we decided to average it out for a document. So if lexical and semantic

28:53.200 --> 28:56.840
similarity methods didn't find anything you would still average it out by three

28:56.840 --> 29:08.040
and it ends up reducing the value of the score. So we were able to find some COAs.

29:08.040 --> 29:14.240
It is it gets because of the patterns that we have sometimes it can extract

29:14.240 --> 29:19.120
certain information that is not necessarily accurate or extracts before

29:19.120 --> 29:24.320
it's supposed to extract the full sentence. We because of this we wanted to

29:24.320 --> 29:27.840
build in that like Shimon said earlier on a learning module to make sure that

29:27.840 --> 29:34.080
we are correcting this automatically or you know then the module in the back end

29:34.080 --> 29:39.080
is learning to update itself and delete the patterns that don't work and all of

29:39.080 --> 29:43.640
that. So we put the Facebook style thumbs down there just to you know delete

29:43.640 --> 29:49.360
anything that we believe is wrong. So now it's taking all of the how-to's out

29:49.360 --> 29:54.760
because it was extracting how-to's and what we built inside that again is the

29:54.760 --> 29:59.580
ability to delete this pattern whatever pattern is pulling out the how-to if it

29:59.580 --> 30:03.000
reaches a certain threshold if you get it gets a certain number of

30:03.000 --> 30:07.320
and thumbs down. So that way the module is learning and as time goes on we want

30:07.320 --> 30:13.040
to be able to build functionality for analysts to be able to you know

30:13.040 --> 30:18.160
highlight a text or highlight a pattern that he thinks the natural language

30:18.160 --> 30:21.560
processing module is missing and then it's going to add on to a set of a

30:21.560 --> 30:28.640
documents that we have. You can see here TTPs that IOCs in here and they pick up

30:28.640 --> 30:33.520
pretty good ones if the document is really known and and the patterns were

30:33.520 --> 30:38.080
tested with that document. I would go in and pick a regular document from cert

30:38.080 --> 30:42.800
and upload that and see what the difference is you would see that some of

30:42.800 --> 30:46.760
them would not be as great as what we have and with the documents that have

30:46.760 --> 31:01.200
been tested already. So what, first one?

31:16.760 --> 31:38.640
Goodness. Okay, TXT. Upload it again.

31:46.760 --> 32:02.640
Once again we should be able to see it being processed and when we highlighted

32:02.640 --> 32:07.120
when it's done we should be able to get into the file and check that out. So

32:07.120 --> 32:10.680
whilst that is running and these are some of the things that these are some

32:10.680 --> 32:15.480
of the findings that we we saw as we the research project went on. We're hoping

32:15.480 --> 32:20.880
that we could use one of the similarity methods, the lexical contextual or

32:20.880 --> 32:25.080
semantic similarity methods to extract the information that we needed but we

32:25.080 --> 32:32.480
we realized that each one has its own advantage so it would it would not be

32:32.480 --> 32:37.440
wise for us to end up to just choose one and leave the others because we we saw

32:37.440 --> 32:42.120
that maybe for contextual similarity was able to pull things like thread actors

32:42.120 --> 32:45.400
the things that have very specific elements that need to be pulled out like

32:45.400 --> 32:51.440
a word or you know a very small phrase a phrase that isn't too long and then for

32:51.440 --> 32:57.720
things like TTPs that sometimes can span multiple lines we saw that using things

32:57.720 --> 33:00.840
like semantic or lexical similarity methods will pull those out better for

33:00.840 --> 33:05.080
us so we ended up using an aggregate of all of these three to be able to get

33:05.080 --> 33:15.720
this the information that we needed. So I believe it is done and we can check that out.

33:15.720 --> 33:18.120
Э

33:34.240 --> 33:38.320
That's that's it.

33:38.320 --> 33:54.760
So out of this document, it was able to pull a bunch of exploit targets for us.

33:54.760 --> 34:00.960
As you can see, it grabs, sometimes it can grab the sentence that you want and not necessarily

34:00.960 --> 34:02.160
give you the element that you want.

34:02.160 --> 34:04.760
It might grab it before, it might grab it after.

34:04.760 --> 34:11.520
It's just, I guess, the patterns and then having to get the right pattern match for

34:11.520 --> 34:12.520
each document.

34:12.520 --> 34:16.200
The thing with natural language processing, and I think it's natural language processing

34:16.200 --> 34:20.480
as a whole, it's still being developed to get to that fine-grained point where it can

34:20.480 --> 34:23.360
pull very specific information.

34:23.360 --> 34:29.400
And so as much as we were trying to get it, and this is also, I guess, a project that

34:29.400 --> 34:34.020
we are still building on and trying to get to the next step, we do have a few things

34:34.020 --> 34:38.040
that need to be tweaked here and there, but it is extracting some information.

34:38.040 --> 34:44.320
So if you, as an analyst, can get to move away from doing all of these things and having

34:44.320 --> 34:47.880
just a document moving up and telling you that I have this amount of information on

34:47.880 --> 34:51.600
there that is related to threat information, I think you should look at that and see if

34:51.600 --> 34:53.200
that applies to you.

34:53.200 --> 34:56.120
It's going to make your job a little more easier.

34:56.120 --> 35:02.080
So after we were done with putting this information in a more structured format, which is the

35:02.080 --> 35:07.720
sticks that we have here, and I think it's a caveat to putting it in a structured format

35:07.720 --> 35:12.360
in sticks, and I'll put that in our challenges at the end, but we wanted to do the higher

35:12.360 --> 35:16.280
order analysis that I think Shimon spoke about early on.

35:16.280 --> 35:21.760
I mean, if we have the information in a more structured format, there's so much we can

35:21.760 --> 35:22.760
do with it.

35:22.760 --> 35:28.560
So we decided to take some examples and do some D3 visualizations on top of that, of

35:28.560 --> 35:32.160
the data that we have to see what we can pull out and what we can do with it.

35:32.160 --> 35:34.320
Like we said, it's just examples.

35:34.320 --> 35:35.320
We had a graph database.

35:35.320 --> 35:39.080
We were pulling out the information out and trying to see what we could relate together

35:39.080 --> 35:45.960
and the kind of information that we can pull out of it now that it's more structured.

35:45.960 --> 35:47.800
So we decided to use this tree map.

35:47.800 --> 35:55.440
This is one of the first ones that we have.

35:55.440 --> 36:01.920
To show the information that we have in there, and thanks to Sean, D3 guy, he helped us a

36:01.920 --> 36:04.200
lot on this.

36:04.200 --> 36:11.000
So we decided that if we could use a tree map to see what information is contained in

36:11.000 --> 36:14.000
the documents, then you can have a general idea.

36:14.000 --> 36:19.080
If you limit the documents that you have to maybe a one week set of documents, and then

36:19.080 --> 36:29.520
you want to see what has shown up a lot in your advice that you've collected.

36:29.520 --> 36:36.960
I think one thing that I've forgotten to say is that as analysts, or as the threat data

36:36.960 --> 36:43.520
team, or the threat analyst team, you collect documents that are usually related to your

36:43.520 --> 36:44.520
organization.

36:44.520 --> 36:47.720
So some of this information can get very specific to your organization because you're collecting

36:47.720 --> 36:53.440
data that's around the kind of operations that you guys run in your organizations.

36:53.440 --> 36:59.160
So here you can come here and just look for maybe threat actors and see in the last week

36:59.160 --> 37:04.080
which threat actors have shown up a lot in the documents that I have, or in the last

37:04.080 --> 37:10.520
day, maybe last month, which one is coming up a lot and why should I be worried about

37:10.520 --> 37:11.520
that.

37:11.520 --> 37:16.640
And then you can do the same for COAs, go into your COAs and see what kind of COAs are

37:16.640 --> 37:17.640
out there.

37:17.640 --> 37:23.560
And the COAs are, the word sentences, or very long sentences, trying to display that in

37:23.560 --> 37:28.560
D3 was a little bit of a hassle too.

37:28.560 --> 37:30.960
We're still trying to figure that out, like we said.

37:30.960 --> 37:40.080
So we want to be able to give the user the ability to be able to see what the whole COA

37:40.080 --> 37:41.080
is.

37:41.080 --> 37:44.840
When I say COA here, I mean causes of action.

37:44.840 --> 37:45.840
I'm sorry.

37:45.840 --> 37:51.560
I'm using a lot of acronyms, am I not?

37:51.560 --> 37:53.560
Anyway.

37:53.560 --> 37:57.320
The other visualization we decided to use was historical analysis.

37:57.320 --> 38:05.960
Here we wanted to either give the analyst the ability to widen the net or tighten the

38:05.960 --> 38:06.960
net some more.

38:06.960 --> 38:10.760
You have documents, and I think one of the problems that we're talking about earlier

38:10.760 --> 38:16.280
on is that as an analyst, you, or as a human trying to analyze these documents, you seem

38:16.280 --> 38:21.960
to may have seen some element in there, maybe a threat actor or a TTP that you were concerned

38:21.960 --> 38:26.320
about, but it was way back, let's say four weeks back or a month back, and you don't

38:26.320 --> 38:28.480
remember which document it was anymore.

38:28.480 --> 38:32.520
You never did anything about it, but all of a sudden you see that picking up again and

38:32.520 --> 38:36.120
you want to be able to find those documents that are related to that document that you

38:36.120 --> 38:38.280
just saw, but there's no way you can remember it.

38:38.280 --> 38:41.520
You don't even know sometimes if you ever came across that.

38:41.520 --> 38:47.400
We wanted to show that information here and make sure that we give you a historical view

38:47.400 --> 38:56.080
of the kind of documents that you have in your repository or your database.

38:56.080 --> 39:01.360
What you see here is a list of the documents that we have in here.

39:01.360 --> 39:04.280
It's a limited list that we have.

39:04.280 --> 39:08.920
Anytime you mouse over a document, let's say you just put a new document in, it shows up.

39:08.920 --> 39:14.760
Anytime you mouse over it, you see the documents that are connected to that document in any

39:14.760 --> 39:15.760
way.

39:15.760 --> 39:18.960
If it's a set of threat actors that are connected, it's going to show you a different color.

39:18.960 --> 39:22.320
If it's a set of IOCs that are connected, it's going to show you a different color.

39:22.320 --> 39:27.280
It just gives you an idea of how your documents are connected together and which ones you

39:27.280 --> 39:32.120
should be paying attention to if you ever want to go back and read or look at some of

39:32.120 --> 39:33.920
information that you have in there.

39:33.920 --> 39:39.360
Like we said, this is all going into our Neo4j database.

39:39.360 --> 39:42.400
So Neo4j has a frontend query.

39:42.400 --> 39:46.320
You can go in there and query the data that you want and do it whichever way you want.

39:46.320 --> 39:47.320
This is D3.

39:47.320 --> 39:48.440
It's all open source.

39:48.440 --> 39:52.000
You can decide to use whichever visualization you want to use on top of this data to pull

39:52.000 --> 39:54.600
whatever you feel is most necessary for you.

39:54.600 --> 39:59.200
These were just examples we wanted to show.

39:59.200 --> 40:10.080
And so some of the challenges that I guess we had going into this and we realized were

40:10.080 --> 40:13.560
going to be a hindrance for us is that, you know, stakes.

40:13.560 --> 40:14.560
We wanted to put it in stakes.

40:14.560 --> 40:15.560
We said stakes.

40:15.560 --> 40:17.640
We wanted to make sure that everything is in stakes.

40:17.640 --> 40:19.000
It's more structured.

40:19.000 --> 40:21.520
But stakes is very expensive.

40:21.520 --> 40:27.040
And the data or the elements that you need to extract into stakes can get very, very,

40:27.040 --> 40:32.480
very granular and you need to make sure that you capture that out of a sentence or you

40:32.480 --> 40:36.720
capture that without missing the meaning of that in the sentence.

40:36.720 --> 40:38.000
And it became very difficult.

40:38.000 --> 40:41.840
We tried with natural language techniques to get granular and we saw that we were missing

40:41.840 --> 40:43.720
a lot of information that we needed.

40:43.720 --> 40:47.600
So we decided to expand it a little more and go for the broader stakes constructs, which

40:47.600 --> 40:53.120
is IOCs and, you know, IOCs, threat actors, things of that sort.

40:53.120 --> 40:54.120
It's broader.

40:54.120 --> 40:55.680
That way we can capture a lot more.

40:55.680 --> 41:00.800
You might have to capture it in a sentence instead of capturing the element, but it still

41:00.800 --> 41:03.240
gives you a better idea of what's in there.

41:03.240 --> 41:07.600
And as time goes on and as this is developed and as people, you know, put a lot more effort

41:07.600 --> 41:12.160
into it, we believe that it's going to get to that point where the natural language processing

41:12.160 --> 41:17.440
engine will be able to go directly and look for things like affected systems and just

41:17.440 --> 41:21.720
pull that affected system out and put it in a very, very specific place that a machine

41:21.720 --> 41:22.720
can easily use.

41:22.720 --> 41:29.720
Do you have anything?

41:29.720 --> 41:34.720
Okay.

41:34.720 --> 41:53.480
Yeah, so that was one of it and also I think this is one of the most basic NLP problems

41:53.480 --> 42:00.000
around having one where that means something else in a different sentence.

42:00.000 --> 42:01.400
So it's difficult.

42:01.400 --> 42:07.680
As humans, we are easily able to kind of make those out, but for a computer, it's a lot

42:07.680 --> 42:13.200
more difficult for the computer to make out that, you know, maybe, what?

42:13.200 --> 42:17.240
Give me an example.

42:17.240 --> 42:19.320
Maybe attacker, right?

42:19.320 --> 42:23.160
In one sentence means that that's the person that actually committed a crime.

42:23.160 --> 42:26.120
It's a threat actor.

42:26.120 --> 42:31.240
And it's simple form, attack simple form is actually attack, right?

42:31.240 --> 42:35.240
And you know, for the computer, it sees it, it tries to break it down into all of its

42:35.240 --> 42:40.640
simplest forms for natural language processing and all of a sudden that attack just means

42:40.640 --> 42:44.320
it's trying to explain something else in the middle of a sentence somewhere in TTP.

42:44.320 --> 42:48.000
It ends up pulling the same attack just like it's trying to pull attacker.

42:48.000 --> 42:50.360
So things like that made it very difficult.

42:50.360 --> 42:55.120
These are underlining NLP problems and I think that as we go forward and we put a little

42:55.120 --> 43:00.640
more effort into it as a community, it's going to improve and this is going to help in some

43:00.640 --> 43:03.160
of the work that we're trying to do.

43:03.160 --> 43:04.160
All right.

43:04.160 --> 43:09.080
Let me just take it home.

43:09.080 --> 43:12.960
So yeah, in terms of what are we trying to do moving forward, right?

43:12.960 --> 43:17.760
Like we talked a little bit about some of the challenges that we face, some of the results

43:17.760 --> 43:19.960
that work, some of the patterns that work.

43:19.960 --> 43:24.320
What we, one of the first things that we want to do is extend the functionality of tagging

43:24.320 --> 43:26.360
new patterns from the screen saw, right?

43:26.360 --> 43:32.680
So that's really put the power of the tool in the hands of the analyst.

43:32.680 --> 43:35.560
There's a lot of fine-tuning to be done with the weighting parameters.

43:35.560 --> 43:41.080
We are, as we go forward, one of the things that we found when we were doing the analysis,

43:41.080 --> 43:47.280
threat actors and I was talking about threat actors, TTPs, were getting really accurate

43:47.280 --> 43:52.800
results using contextual similarity, but then there are highly specific elements within indicators

43:52.800 --> 43:58.720
of compromise, within observables that are better suited for lexical and semantic analysis.

43:58.720 --> 44:04.280
So we are looking at those components as we go through the iterative process of making

44:04.280 --> 44:09.200
this solution and the underlying engine better.

44:09.200 --> 44:11.600
Expand the pattern sets for different types of documents.

44:11.600 --> 44:17.320
Right now we are heavily focused on USERT, ICSERT, and MS-ISAC.

44:17.320 --> 44:22.520
Those documents tend to have a lot more vulnerability, courses of action.

44:22.520 --> 44:25.440
Sometimes they might have threat actor and campaign names.

44:25.440 --> 44:26.520
Unusual.

44:26.520 --> 44:34.640
So we want to expand the actual corpus to include that spectrum of information.

44:34.640 --> 44:38.440
And one of the next things that the National Language Processing team is really going to

44:38.440 --> 44:43.120
focus on is being able to incorporate named entity recognition.

44:43.120 --> 44:48.440
So being able to identify references to known names and entities in the text themselves.

44:48.440 --> 44:55.040
This can be really useful if people do follow a common standard of naming campaigns and

44:55.040 --> 44:56.040
threat actors.

44:56.040 --> 45:01.800
These can be used for named entity recognition techniques.

45:01.800 --> 45:08.840
So everything that we've used here is based off open source technology.

45:08.840 --> 45:14.520
And if you do want to tinker, and I believe quite a lot of you here are hobbyists, if

45:14.520 --> 45:18.600
you want to build your own stack, where do you go about putting this together?

45:18.600 --> 45:19.880
There aren't a lot of NLP people here.

45:19.880 --> 45:26.800
We had to go about trying to rely on the knowledge of our team to really understand how do you

45:26.800 --> 45:28.480
get to put this whole stack together.

45:28.480 --> 45:32.320
So Apache Open NLP, definitely look that up.

45:32.320 --> 45:38.120
Excellent resource of both implemented algorithms and documentation.

45:38.120 --> 45:43.600
They have all the underlying technologies that you need for parsing the sentences, applying

45:43.600 --> 45:49.840
stemming, doing any kind of part of speech tagging, any kind of basic parsing.

45:49.840 --> 45:55.520
They have a lot of great implemented algorithms already out there.

45:55.520 --> 46:00.960
The Stanford National Language Processing Group, again, a great wealth of information

46:00.960 --> 46:01.960
over there.

46:01.960 --> 46:09.760
They also have released a lot of their algorithms on their website.

46:09.760 --> 46:14.800
And WordNet, it's essentially a lexical database of English.

46:14.800 --> 46:20.480
So whenever you're trying to do similarity scores between words that could have the same

46:20.480 --> 46:25.960
meaning or could be related, WordNet is the world's largest lexical database.

46:25.960 --> 46:30.960
So definitely that's a core component of what we are using.

46:30.960 --> 46:36.960
And then the TextRank algorithm, again, this is a core part of how we are doing the lexical

46:36.960 --> 46:41.840
similarity matching or how we intend on implementing lexical similarity matching.

46:41.840 --> 46:43.800
All of this is open source.

46:43.800 --> 46:47.680
It's available out there for you to go download.

46:47.680 --> 46:52.020
And one of the things that we want to do over the next, in the near future, is we'll be

46:52.020 --> 46:59.760
setting up, we're looking to set up a GitHub repository and at least start a flow of information

46:59.760 --> 47:05.560
of our learnings into the open community.

47:05.560 --> 47:09.680
So yeah, with that, I think we still have a few minutes for questions.

47:09.680 --> 47:13.320
I'm sure you guys have a few.

47:13.320 --> 47:15.520
And thanks for your time.

47:15.520 --> 47:20.760
And if you do want to follow us, you can follow me on Twitter.

47:20.760 --> 47:21.760
Hit me up with any questions.

47:21.760 --> 47:27.160
Any if you want to take anything offline, we'll be happy to start a conversation.

47:27.160 --> 47:36.000
But with that, with whatever little time that we have left, we'd be happy to take any questions.

47:36.000 --> 47:58.240
Thank you.